"If you tell the truth, you don't have to remember anything", Mark Twain
Recent advances made AI systems components of high-stakes decision processes. The nature of these decisions has urged a drive to create algorithms, methods, and techniques to accompany outputs from AI systems with explanations. This drive is motivated in part by laws and regulations which state that decisions, including those from automated systems, provide information about the logic behind those decisions and the desire to create trustworthy AI. It may not be wrong to assume that the failure to articulate the rationale for an answer can affect the level of trust users will grant that system. Suspicions that the AI system is biased or unfair may slow acceptance of the technology. In terms of societal acceptance and trust, developers of AI systems may need to consider that multiple attributes of an AI system can influence public perception of the system. Explainable AI is one of several properties that characterize trust in AI systems. Other properties include resiliency, reliability, bias, and accountability. As a side note, an already highly sought concept in machine learning is the generalization ability of an AI model.
XAI can answer the following questions: "why a particular answer was obtained?"; "how a particular answer was obtained?"; "when a particular AI-based system can fail?". Thus, XAI can provide trustworthiness, transparency, confidence, and informativeness. The scope of XAI can be as broad as scope of AI. In effect XAI can be a post-analytic on AI solving a particular problem. Some examples,
Intrinsic $\leftrightarrow$ Black Box Explanations. Intrinsically explainable, white-box or interpretable models are linear regression, logistic regression, and decision tree models.
A variable behaving or affecting differently in different parts of its scale. Example applications are cybersecurity where global and local explanations are both needed.
Agnostic $\leftrightarrow$ Model Dependent. Model-agnostic XAI treats the model as a black box without accessing to the model's internal parameters. The strength of the agnostic methods is that they may be applied to any ML model to generate explanations. The order of taxonomy is intrinsic (or not), then local (or global) explanation, and model agnostic (or dependent).
Use breast cancer dataset to predict recurrence event. Helping the medical experts (SME) can be in as the following,
Let's use Logistic Regression classifier.
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
from IPython.display import display
import numpy as np
import pandas as pd
import sys
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
# import warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
from warnings import simplefilter
simplefilter(action='ignore', category=DeprecationWarning)
def bc_preprocess(_df):
# Check if we have any '?' in df values
print(_df.columns[_df.isin(['?']).any()])
# Check if we have any NaN in df values
print(_df.columns[_df.isnull().any()])
# Drop the rows with '?'
_df = _df[~_df['node-caps'].isin(['?']) & ~_df['breast-quad'].isin(['?'])]
# See how many rows we lost
print(f'#rows= {len(_df)} #columns= {len(_df.columns)}')
_df.reset_index(inplace=True, drop=True) # required after row deletion
return _df
# Load data
df = pd.read_csv('../../../EP_datasets/breast_cancer_preprocessed.csv')
# Sanity check
print(f'#rows= {len(df)} #columns= {len(df.columns)}')
df.head()
#rows= 286 #columns= 10
| age | menopause | tumor-size | inv-nodes | node-caps | deg-malig | breast | breast-quad | irradiat | recurrence | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44 | premeno | 18 | 0 | yes | 3 | right | left_up | no | recurrence-events |
| 1 | 58 | ge40 | 18 | 1 | no | 1 | right | central | no | no-recurrence-events |
| 2 | 51 | ge40 | 39 | 0 | no | 2 | left | left_low | no | recurrence-events |
| 3 | 49 | premeno | 35 | 1 | yes | 3 | right | left_low | yes | no-recurrence-events |
| 4 | 41 | premeno | 32 | 5 | yes | 2 | left | right_up | no | recurrence-events |
df = bc_preprocess(df)
Index(['node-caps', 'breast-quad'], dtype='object') Index([], dtype='object') #rows= 277 #columns= 10
# feature exploration
print(df['menopause'].unique())
print(df['node-caps'].unique())
print(df['breast'].unique())
print(df['breast-quad'].unique())
print(df['irradiat'].unique())
['premeno' 'ge40' 'lt40'] ['yes' 'no'] ['right' 'left'] ['left_up' 'central' 'left_low' 'right_up' 'right_low'] ['no' 'yes']
# treat binary variables as 0 and 1
df['node-caps'] = df['node-caps'].replace({'no':0, 'yes':1})
df['irradiat'] = df['irradiat'].replace({'no':0, 'yes':1})
def encode_onehot (_df, _f_target):
___df = _df.copy()
# Convert all features of type object to one-hot encoded with pandas dummies
for f in list(_df.columns.values):
if _df[f].dtype == object:
if f != _f_target: # recurrence is the target variable and will be treated differently
__df = pd.get_dummies(_df[f], prefix='',
prefix_sep='').groupby(level=0, axis=1).max().add_prefix(f+' - ')
_df = pd.concat([_df, __df], axis=1)
_df = _df.drop([f], axis=1)
return _df
# Convert the variable recurrence to numerical
df = encode_onehot(df, 'recurrence')
# Sanity check
df.head()
| age | tumor-size | inv-nodes | node-caps | deg-malig | irradiat | recurrence | menopause - ge40 | menopause - lt40 | menopause - premeno | breast - left | breast - right | breast-quad - central | breast-quad - left_low | breast-quad - left_up | breast-quad - right_low | breast-quad - right_up | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44 | 18 | 0 | 1 | 3 | 0 | recurrence-events | False | False | True | False | True | False | False | True | False | False |
| 1 | 58 | 18 | 1 | 0 | 1 | 0 | no-recurrence-events | True | False | False | False | True | True | False | False | False | False |
| 2 | 51 | 39 | 0 | 0 | 2 | 0 | recurrence-events | True | False | False | True | False | False | True | False | False | False |
| 3 | 49 | 35 | 1 | 1 | 3 | 1 | no-recurrence-events | False | False | True | False | True | False | True | False | False | False |
| 4 | 41 | 32 | 5 | 1 | 2 | 0 | recurrence-events | False | False | True | True | False | False | False | False | False | True |
from sklearn.feature_selection import SelectPercentile, f_classif
X = df.loc[:, df.columns != 'recurrence'].values
y = df.loc[:, df.columns == 'recurrence'].values.ravel()
selector = SelectPercentile(f_classif, percentile=10)
# Fit the data
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
# Display
plt.figure(figsize=(10,5), dpi=72)
cols = [_ for _ in df.columns if _ != 'recurrence']
y_pos = np.arange(len(cols))
plt.bar(y_pos, scores)
plt.xticks(y_pos, cols, rotation=90, fontsize=12)
plt.grid(axis='y')
plt.title('Feature Ranks')
plt.show()
# Build a model
clf_lr = LogisticRegression(random_state=0, solver='sag', max_iter=10000).fit(X, y)
Local Interpretable Model-agnostic Explanations (LIME) is an explanation technique by Ribeiro (see ref.) that explains the prediction of any classifier by learning an interpretable model locally around the prediction. LIME offers,
LIME explains a prediction so that experts and non-experts could compare and improve on an untrustworthy model through feature engineering. An ideal model explainer should contain the following desirable properties,
Providing an explanation to a single data instance (a local explanation) is not always sufficient. A global understanding of the model to have higher reliability on the robustness of the model can be achieved by the SP-LIME algorithm. This algorithm runs the explanations on multiple diverse sets of instances and returns non-redundant explanations, in a way attempt to generalize the explanations.
The algorithm considers a limited time required to go through all the individual local explanations. The number of explanations is the budget $B$, the set of instances $X$. Then the task of selecting $B$ instances to analyze for model explainability is defined as the pick step. The pick step is independent of the existence of the explanation, and it needs to provide non-redundant explanations by picking up a diverse representative set of instances to explain how the model is behaving considering a global perspective.
The Explanation Matrix $W$, in which $W=n*d'$, such that $n$ is the number of samples and $d'$ is the human interpretable features. The global importance component matrix $I$, in which for each component of $j, I(j)$ represents the global importance in the explanation space. Intuitively, $I$ is formulated in a way to assign higher scores to features, which explains many instances of the data. The set of important features that are considered for the explanations is denoted by $V$. Then, the algorithm tries to learn a non-redundant coverage intuition function, $c(V,W,I)$. The non-redundant coverage intuition tries to compute the collective importance of all features that appear in at least one instance in set $V$. The pick problem becomes maximizing the weighted coverage cost function:
$\mathrm{Pick}(W,I)=\underset{V, \|V\| \le B}{\mathrm{arg\,max}} \ c(V,W,I)$
Now let's see how the LIME XAI library can help. Install the library LIME by conda install lime↩
Note that LIME uses predict_proba method of sklearn learners.
Improvements: Add a correlation matrix, an ROC
from lime.lime_tabular import LimeTabularExplainer
from lime.lime_text import LimeTextExplainer
explainer = LimeTabularExplainer(X, mode='classification',
feature_names=cols, class_names=df['recurrence'].unique().tolist(),
categorical_features=None)
D_ID = 0
N_FEATURES= 7
exp = explainer.explain_instance(X[D_ID], clf_lr.predict_proba, num_features=N_FEATURES)
exp.show_in_notebook(show_table=True)
Classical feature pre-processing has methods of explanations:
As demonstrated below Recurrence has strong correlation to Age, Degree Malignant, etc.
import seaborn as sns; sns.set(style="ticks", color_codes=True)
# attempt correlation, need numericals
df_ = df.copy()
df_['recurrence'].replace({'recurrence-events':0, 'no-recurrence-events':1}, inplace=True)
df_ = df_[sorted(df_.columns)] # explanation requires some order
# Check variable correlations to infer about strong features
corr_df1, corr_df2 = df_.corr(method='pearson'), df_.corr(method='kendall')
# Draw a lower triangle matrix since correlation is symmetrical
mask1, mask2 = np.zeros_like(corr_df1, dtype=bool), np.zeros_like(corr_df2, dtype=bool)
mask1[np.triu_indices_from(mask1)], mask2[np.triu_indices_from(mask2)] = True, True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.set(style="white")
_, ax = plt.subplots(nrows=1, ncols=2, sharex=True, sharey=True, figsize=(12, 12), dpi=72)
ax = ax.flatten()
sns.heatmap(corr_df1, ax=ax[0], mask=mask1, cmap=cmap, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .25})
ax[0].set_title('Pearson Correlation')
sns.heatmap(corr_df2, ax=ax[1], mask=mask2, cmap=cmap, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .25})
ax[1].set_title('Spearman Rank Correlation')
plt.show()
# explore hyperparameter space
def run_classifier(_clf, _X, _y):
tpr, fpr, acc = [], [], []
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
for tr_i, ts_i in kf.split(_X, _y):
_clf.fit(_X[tr_i], _y[tr_i])
y_pred = _clf.predict(_X[ts_i])
acc += [accuracy_score(_y[ts_i], y_pred)]
tn, fp, fn, tp = confusion_matrix(_y[ts_i], y_pred, labels=[1,0]).ravel()
tpr += [tp/(tp+fn)] # Probability
fpr += [fp/(fp+tn)] # False alarm
return np.array(acc), np.mean(tpr), np.mean(fpr)
def run_params_logisticregression(_X, _y):
Tpr, Fpr, Acc, Std = [], [], [], []
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
for solv in ('lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'):
for invreg_c in (2e-1, 0.5, 0.8, 1, 2, 5, 1e1, 2e1, 1e2):
clf = LogisticRegression(C=invreg_c, solver=solv, random_state=0, max_iter=3000)
acc, tpr, fpr = run_classifier(clf, _X, _y)
acc, std = np.mean(acc), np.std(acc)
Acc += [acc]; Std += [std]; Tpr += [tpr]; Fpr += [fpr]
# debug
sys.stderr.write(f"\r LR acc={acc:.2f} {chr(177)}{std:.3f}"); sys.stderr.flush()
return Tpr, Fpr, Acc, np.array(Std)
def run_params_randomforest(_X, _y):
Tpr, Fpr, Acc, Std = [], [], [], []
for ne in np.linspace(20,100, num=3, dtype=int):
for md in np.linspace(2,8, num=3, dtype=int):
for mf in np.linspace(3,20, num=3, dtype=int):
clf = RandomForestClassifier(n_estimators=ne, max_depth=md, max_features=mf, n_jobs=-1)
acc, tpr, fpr = run_classifier(clf, _X, _y)
acc, std = np.mean(acc), np.std(acc)
Acc += [acc]; Std += [std]; Tpr += [tpr]; Fpr += [fpr]
# debug
sys.stderr.write(f"\r RF acc={acc:.2f} {chr(177)}{std:.3f}"); sys.stderr.flush()
return Tpr, Fpr, Acc, np.array(Std)
def run_params_gradientboosting(_X, _y):
Tpr, Fpr, Acc, Std = [], [], [], []
for ne in np.linspace(10,100, num=5, dtype=int):
clf = GradientBoostingClassifier(n_estimators=ne)
acc, tpr, fpr = run_classifier(clf, _X, _y)
acc, std = np.mean(acc), np.std(acc)
Acc += [acc]; Std += [std]; Tpr += [tpr]; Fpr += [fpr]
# debug
sys.stderr.write(f"\r GB acc={acc:.2f} {chr(177)}{std:.3f}"); sys.stderr.flush()
return Tpr, Fpr, Acc, np.array(Std)
X = df_.loc[:, df_.columns != 'recurrence'].values
y = df_.loc[:, df_.columns == 'recurrence'].values.ravel()
Tpr1, Fpr1, Acc1, Std1 = run_params_logisticregression(X, y)
Tpr2, Fpr2, Acc2, Std2 = run_params_randomforest(X, y)
Tpr3, Fpr3, Acc3, Std3 = run_params_gradientboosting(X, y)
GB acc=0.72 ±0.079
# Size of the marker shows the error variance
COEF = 100
Std10 = (COEF*np.array(Std1))**2
Std20 = (COEF*np.array(Std2))**2
Std30 = (COEF*np.array(Std3))**2
def diagonal(_ax):
_ax.plot(np.arange(0.001,1,0.01), np.arange(0.001,1,0.01), linestyle='--', color=(0.6, 0.6, 0.6), label='coin flip')
_, ax = plt.subplots(nrows=1, ncols=3, sharex=True, sharey=True, figsize=(14, 3), dpi=72)
ax = ax.flatten()
ax[0].scatter(Fpr1, Tpr1, s=Std10, marker='o', c='r', label='Logistic Regression', alpha=0.7)
diagonal(ax[0])
ax[0].set_ylabel('True Positive Rate')
ax[0].set_xlabel('False Positive Rate')
ax[0].legend()
ax[0].grid()
ax[1].scatter(Fpr2, Tpr2, s=Std20, marker='o', c='r', label='Random Forest', alpha=0.7)
diagonal(ax[1])
ax[1].set_xlabel('False Positive Rate')
ax[1].legend()
ax[1].grid()
ax[2].scatter(Fpr3, Tpr3, s=Std30, marker='o', c='r', label='Gradient Boosting', alpha=0.7)
diagonal(ax[2])
ax[2].set_xlabel('False Positive Rate')
ax[2].legend()
ax[2].grid()
plt.show()
The 20 newsgroups dataset comprises around 18k newsgroups posts on 20 topics split in two subsets of training and testing. Let's build a model and differentiate between predictions with respect to features, in this case specific words from the vocabulary.
from sklearn.datasets import fetch_20newsgroups
# Explore the dataset
print(fetch_20newsgroups()['target_names'])
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
# Select two classes for the lecture
Categories = ['rec.motorcycles', 'misc.forsale']
# Build a model
ng_train = fetch_20newsgroups(subset='train', categories=Categories)
ng_test = fetch_20newsgroups(subset='test', categories=Categories)
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False, # stop_words='english',
max_features=3000, dtype=np.float32)
X_train = vectorizer.fit_transform(ng_train.data)
X_test = vectorizer.transform(ng_test.data)
svc = SVC(kernel='linear', probability=True).fit(X_train, ng_train.target)
print(f'Linear SVC accuracy={accuracy_score(ng_test.target, svc.predict(X_test)):.2f}')
Linear SVC accuracy=0.98
rf = RandomForestClassifier(n_estimators=500).fit(X_train, ng_train.target)
print(f'RF accuracy={accuracy_score(ng_test.target, rf.predict(X_test)):.2f}')
RF accuracy=0.96
c1 = make_pipeline(vectorizer, svc)
print(c1.predict_proba([ng_test.data[0]]))
explainer1 = LimeTextExplainer(class_names=Categories)
[[0.07546738 0.92453262]]
c2 = make_pipeline(vectorizer, rf)
print(c2.predict_proba([ng_test.data[0]]))
explainer2 = LimeTextExplainer(class_names=Categories)
[[0.306 0.694]]
DOC_ID= 42
N_FEATURES= 7
# Explain features helped this document to be classified
exp1 = explainer1.explain_instance(ng_test.data[DOC_ID], c1.predict_proba, num_features=N_FEATURES)
print(f'Document id= {DOC_ID}')
print(f'True class= {Categories[ng_test.target[DOC_ID]]}')
print()
print(f'LinearSVC Prob {Categories[1]} =, {c1.predict_proba([ng_test.data[DOC_ID]])[0,1]:.3f}')
print(', '.join([f"{w} {s:.3f}" for w,s in exp1.as_list()]))
#print(f'[w+s%.2f for w,s in exp2.as_list()]')
exp2 = explainer2.explain_instance(ng_test.data[DOC_ID], c2.predict_proba, num_features=N_FEATURES)
print()
print(f'RF Prob {Categories[1]} =, {c2.predict_proba([ng_test.data[DOC_ID]])[0,1]:.3f}')
print(', '.join([f"{w} {s:.3f}" for w,s in exp2.as_list()]))
Document id= 42 True class= misc.forsale LinearSVC Prob misc.forsale =, 0.993 Stafford 0.066, Winona 0.064, do 0.035, Re 0.030, shaft 0.027, All -0.015, University -0.011 RF Prob misc.forsale =, 0.734 Re 0.121, In 0.094, article 0.090, do 0.048, wrote 0.046, what 0.038, shaft 0.024
exp1.show_in_notebook(text=True)
exp2.show_in_notebook(text=True)
# Explain features helped this document to be classified, effect when removed first two features
print(f'Original SVC prediction= {svc.predict_proba(X_test[DOC_ID])[0,1]:.3f}')
tmp = X_test[DOC_ID].copy()
tmp[0,vectorizer.vocabulary_[exp1.as_list()[0][0]]] = 0
tmp[0,vectorizer.vocabulary_[exp1.as_list()[1][0]]] = 0
print(f'Prob after removing feats ({exp1.as_list()[0][0]} {exp1.as_list()[1][0]})= {svc.predict_proba(tmp)[0,1]:.3f}')
print(f'Difference= {svc.predict_proba(X_test[DOC_ID])[0,1]-svc.predict_proba(tmp)[0,1]:.3f}')
print()
print(f'Original RF prediction= {rf.predict_proba(X_test[DOC_ID])[0,1]:.3f}')
tmp = X_test[DOC_ID].copy()
tmp[0,vectorizer.vocabulary_[exp2.as_list()[0][0]]] = 0
tmp[0,vectorizer.vocabulary_[exp2.as_list()[1][0]]] = 0
print(f'Prob after removing feats ({exp2.as_list()[0][0]} {exp2.as_list()[1][0]})= {rf.predict_proba(tmp)[0,1]:.3f}')
print(f'Difference= {rf.predict_proba(X_test[DOC_ID])[0,1]-rf.predict_proba(tmp)[0,1]:.3f}')
Original SVC prediction= 0.993 Prob after removing feats (Stafford Winona)= 0.926 Difference= 0.068 Original RF prediction= 0.734 Prob after removing feats (Re In)= 0.564 Difference= 0.170
Apriori analysis and association rule learning can be used to find patterns in datasets without building supervised or unspervised learning models. This kind of analysis can help interface the data scientist and the subject-matter expert about the relevancy and correctness of the data at hand, etc. Let's use mlxtend (machine learning extensions) Python library. Install with conda install mlxtend↩
Let's re-attempt breast cancer dataset. The procedure is that we will use categorical variables as they are and discretize numerical variables into {low, medium, high} like levels.
# Build transactions
df = pd.read_csv('../../../EP_datasets/breast_cancer_preprocessed.csv')
df = bc_preprocess(df)
df.head()
Index(['node-caps', 'breast-quad'], dtype='object') Index([], dtype='object') #rows= 277 #columns= 10
| age | menopause | tumor-size | inv-nodes | node-caps | deg-malig | breast | breast-quad | irradiat | recurrence | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44 | premeno | 18 | 0 | yes | 3 | right | left_up | no | recurrence-events |
| 1 | 58 | ge40 | 18 | 1 | no | 1 | right | central | no | no-recurrence-events |
| 2 | 51 | ge40 | 39 | 0 | no | 2 | left | left_low | no | recurrence-events |
| 3 | 49 | premeno | 35 | 1 | yes | 3 | right | left_low | yes | no-recurrence-events |
| 4 | 41 | premeno | 32 | 5 | yes | 2 | left | right_up | no | recurrence-events |
# Check ranges and quantize numerical vars
df.hist(column=['age', 'tumor-size', 'inv-nodes', 'deg-malig'], bins=50, figsize=(13, 2), layout=(1,4))
plt.show()
# Convert variables to nominal
df.loc[df['age']<=40, 'dage'] = 'young'
df.loc[(df['age']>40)&(df['age']<=60), 'dage'] = 'aged'
df.loc[df['age']>60, 'dage'] = 'senior'
df.loc[df['tumor-size']<=20, 'dtumorsize'] = 'ts_small'
df.loc[(df['tumor-size']>20)&(df['tumor-size']<=40), 'dtumorsize'] = 'ts_medium'
df.loc[df['tumor-size']>40, 'dtumorsize'] = 'ts_large'
df.loc[df['inv-nodes']<5, 'dnodes'] = 'nodes_low'
df.loc[df['inv-nodes']>=5, 'dnodes'] = 'nodes_high'
df.loc[df['deg-malig']<=1, 'dmalig'] = 'malig_low'
df.loc[df['deg-malig']==2, 'dmalig'] = 'malig_med'
df.loc[df['deg-malig']>=2, 'dmalig'] = 'malig_high'
df['node-caps'].replace({'no':'nc_no','yes':'nc_yes'}, inplace=True)
df['breast'].replace({'left':'br_left','right':'br_right'}, inplace=True)
df['irradiat'].replace({'no':'ir_no','yes':'ir_yes'}, inplace=True)
df.drop(columns=['age', 'tumor-size', 'inv-nodes', 'deg-malig'], inplace=True)
df.head()
| menopause | node-caps | breast | breast-quad | irradiat | recurrence | dage | dtumorsize | dnodes | dmalig | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | premeno | nc_yes | br_right | left_up | ir_no | recurrence-events | aged | ts_small | nodes_low | malig_high |
| 1 | ge40 | nc_no | br_right | central | ir_no | no-recurrence-events | aged | ts_small | nodes_low | malig_low |
| 2 | ge40 | nc_no | br_left | left_low | ir_no | recurrence-events | aged | ts_medium | nodes_low | malig_high |
| 3 | premeno | nc_yes | br_right | left_low | ir_yes | no-recurrence-events | aged | ts_medium | nodes_low | malig_high |
| 4 | premeno | nc_yes | br_left | right_up | ir_no | recurrence-events | aged | ts_medium | nodes_high | malig_high |
# Build transactions
Vocabulary= [_ for col in df.columns for _ in df[col].unique().tolist()]
print(Vocabulary)
Vocix = {_:i for i,_ in enumerate(Vocabulary)}
N, M = len(df), len(Vocabulary)
Transactions = np.full((N,M), False, dtype=bool)
for i,row in df.iterrows():
for col in df.columns:
Transactions[i][Vocix[row[col]]] = True
# Sanity, debug, print row index 10, 11
# print(f'{Transactions[10:12].astype(int)}')
['premeno', 'ge40', 'lt40', 'nc_yes', 'nc_no', 'br_right', 'br_left', 'left_up', 'central', 'left_low', 'right_up', 'right_low', 'ir_no', 'ir_yes', 'recurrence-events', 'no-recurrence-events', 'aged', 'young', 'senior', 'ts_small', 'ts_medium', 'ts_large', 'nodes_low', 'nodes_high', 'malig_high', 'malig_low']
from mlxtend.frequent_patterns import apriori, association_rules
# Create one hot DataFrame
df = pd.DataFrame(Transactions, columns=Vocabulary)
df_apriori = apriori(df, min_support=0.05, use_colnames=True, max_len=10)
print(f'{len(df_apriori)} many itemsets')
df_apriori.tail()
4301 many itemsets
| support | itemsets | |
|---|---|---|
| 4296 | 0.061372 | (ir_no, aged, no-recurrence-events, nodes_low,... |
| 4297 | 0.050542 | (ir_no, aged, br_left, no-recurrence-events, n... |
| 4298 | 0.050542 | (ir_no, aged, br_left, no-recurrence-events, n... |
| 4299 | 0.064982 | (ir_no, aged, br_left, no-recurrence-events, n... |
| 4300 | 0.057762 | (ir_no, aged, no-recurrence-events, nodes_low,... |
# extract rules
df_rules = association_rules(df_apriori, metric='confidence', min_threshold=0.9, support_only=False)
print(f'{len(df_rules)} many rules')
df_rules.head()
3897 many rules
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (young) | (premeno) | 0.155235 | 0.537906 | 0.148014 | 0.953488 | 1.772592 | 0.064513 | 9.935018 | 0.515947 |
| 1 | (senior) | (ge40) | 0.180505 | 0.444043 | 0.180505 | 1.000000 | 2.252033 | 0.100353 | inf | 0.678414 |
| 2 | (nc_yes) | (malig_high) | 0.202166 | 0.761733 | 0.202166 | 1.000000 | 1.312796 | 0.048170 | inf | 0.298643 |
| 3 | (nodes_low) | (nc_no) | 0.826715 | 0.797834 | 0.754513 | 0.912664 | 1.143927 | 0.094932 | 2.314801 | 0.726077 |
| 4 | (nc_no) | (nodes_low) | 0.797834 | 0.826715 | 0.754513 | 0.945701 | 1.143927 | 0.094932 | 3.191336 | 0.622351 |
def get_consequent(_df):
_df['norecur'], _df['recur'] = False, False
_df['recur'] = _df['consequents'].apply(lambda x: 'recurrence-events' in x and len(x)==1)
_df['norecur'] = _df['consequents'].apply(lambda x: 'no-recurrence-events' in x and len(x)==1)
return _df[_df['norecur']==True].copy(), _df[_df['recur']==True].copy()
df0, df1 = get_consequent(df_rules)
display(df0)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | norecur | recur | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | (right_low, premeno) | (no-recurrence-events) | 0.054152 | 0.707581 | 0.050542 | 0.933333 | 1.319048 | 0.012225 | 4.386282 | 0.255725 | True | False |
| 54 | (malig_low, ge40) | (no-recurrence-events) | 0.111913 | 0.707581 | 0.104693 | 0.935484 | 1.322087 | 0.025505 | 4.532491 | 0.274320 | True | False |
| 99 | (right_low, nc_no) | (no-recurrence-events) | 0.064982 | 0.707581 | 0.061372 | 0.944444 | 1.334751 | 0.015392 | 5.263538 | 0.268226 | True | False |
| 114 | (ts_small, nc_no) | (no-recurrence-events) | 0.252708 | 0.707581 | 0.231047 | 0.914286 | 1.292128 | 0.052236 | 3.411552 | 0.302536 | True | False |
| 148 | (br_left, malig_low) | (no-recurrence-events) | 0.122744 | 0.707581 | 0.111913 | 0.911765 | 1.288565 | 0.025062 | 3.314079 | 0.255277 | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3768 | (ir_no, aged, br_left, nodes_low, nc_no, malig... | (no-recurrence-events) | 0.079422 | 0.707581 | 0.075812 | 0.954545 | 1.349026 | 0.019614 | 6.433213 | 0.281046 | True | False |
| 3782 | (ir_no, br_left, nodes_low, ts_small, nc_no, m... | (no-recurrence-events) | 0.054152 | 0.707581 | 0.054152 | 1.000000 | 1.413265 | 0.015835 | inf | 0.309160 | True | False |
| 3820 | (ir_no, aged, nodes_low, ts_small, nc_no, left... | (no-recurrence-events) | 0.061372 | 0.707581 | 0.057762 | 0.941176 | 1.330132 | 0.014336 | 4.971119 | 0.264423 | True | False |
| 3831 | (ir_no, aged, nodes_low, nc_no, malig_low, lef... | (no-recurrence-events) | 0.064982 | 0.707581 | 0.061372 | 0.944444 | 1.334751 | 0.015392 | 5.263538 | 0.268226 | True | False |
| 3845 | (ir_no, aged, nodes_low, ts_small, nc_no, mali... | (no-recurrence-events) | 0.057762 | 0.707581 | 0.057762 | 1.000000 | 1.413265 | 0.016891 | inf | 0.310345 | True | False |
195 rows × 12 columns
display(df1)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | norecur | recur | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 828 | (nodes_high, br_left, ir_yes) | (recurrence-events) | 0.054152 | 0.292419 | 0.050542 | 0.933333 | 3.19177 | 0.034707 | 10.613718 | 0.726009 | False | True |
| 2163 | (nodes_high, malig_high, br_left, ir_yes) | (recurrence-events) | 0.054152 | 0.292419 | 0.050542 | 0.933333 | 3.19177 | 0.034707 | 10.613718 | 0.726009 | False | True |
Exercise 1. Find out if the LIME results support the LR feature ranking in the breast cancer problem.
Exercise 2. According to LIME which classifier is more robust in newsgroups problem?
Exercise 3. Why is that the correlation between Recurrence and Age is reddish (as opposed to blueish)?
Exercise 4. Determine a rule with 'ts_large' and find out its highest support and confidence.
%%html
<style>
table {margin-left: 0 !important;}
</style>
<!-- Display markdown tables left oriented in this notebook. -->